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Abstract 

Restricted Boltzmann Machines (RBM) have attracted a lot of attention of late, 
as one the principle building blocks of deep networks. Training RBMs remains 
problematic however, because of the intractibility of their partition function. The 
maximum likelihood gradient requires a very robust sampler which can accurately 
sample from the model despite the loss of ergodicity often incurred during learn- 
ing. While using Parallel Tempering in the negative phase of Stochastic Maximum 
LikeUhood (SML-PT) helps address the issue, it imposes a trade-off between com- 
putational complexity and high ergodicity, and requires careful hand-tuning of the 
temperatures. In this paper, we show that this trade-off is unnecessary. The choice 
of optimal temperatures can be automated by minimizing average return time (a 
concept first proposed by (Katzgraber et al., 2006)) while chains can be spawned 
dynamically, as needed, thus noinimizing the computational overhead. We show 
on a synthetic dataset, that this results in better likelihood scores. 



1 Introduction 

Restricted Boltzmann Machines (RBM) (Freund & Haussler, 1994; WelUng et al., 2005; Hinton 
et al., 2006) have become a model of choice for learning unsupervised features for use in deep feed- 
forward architectures (Hinton et al., 2006; Bengio, 2009) as well as for modeling complex, high- 
dimensional distributions (Welling et al., 2005; Taylor & Hinton, 2009; Larochelle et al., 2010). 
Their success can be explained in part through the bi-partite structure of their graphical model. 
Units are grouped into a visible layer v and a hidden layer h, prohibiting cormections within the 
same layer The use of latent variables affords RBMs a rich modeling capacity, while the conditional 
independence property yields a trivial inference procedure. 

RBMs are parametrized by an energy function E(v.h) which is converted to probability through 
the Boltzmann distribution, after marginalizing out the hidden units. The probability of a given 
configuration p(v) is thus given by p(v) = J2h exp(— E(v, h)), where Z is the partition function 
defined as Z = J2v h exp(— E(v, h)). 

Despite their popularity, direct learning of these models through maximum likelihood remains prob- 
lematic. The maximum Ukelihood gradient with respect to the parameters 6 of the model is: 

aiogp(v) , ,aE(v,h) ^ , _ ^_,5E(v-,h-) 

h v-,h- 

The first term is trivial to calculate and is referred to as the positive phase, as it raises the probability 
of training data. The second term or negative pliase is intractable in most appUcations of interest. 
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as it involves an expectation over p(v,h). Many learning algorithms have been proposed in the 
literature to address this issue: 

• Contrastive Divergence (CD) (Hinton, 1999; Hinton, 2002) replaces the expectation with a 
finite set of negative samples, which are obtained by running a short Markov chain initial- 
ized at positive training examples. This yields a biased, but low-variance gradient which 
has been shown to work well as a feature extractor for deep networks such as the Deep 
Belief Network (Hinton et al., 2006). 

• Stochastic Maximum Likelihood (SML) or Persistent Contrastive Divergence (PCD) 
(Younes, 1998; Tieleman, 2008) on the other hand, relies on a persistent Markov chain 
to sample the negative particles. The chain is run for a small number of steps between 
consecutive model updates, with the assumption that the Markov chain will stay close to 
its equilibrium distribution as the parameters evolve. Learning actually encourages this 
process, in what is called the "fast-weight effect" (Tieleman & Hinton, 2009). 

• Ratio Matching and Score Matching (Hyvarinen, 2005; Hyvarinen, 2007) avoid the issue 
of the partition function altogether by replacing maximum likelihood by another learning 
principle, based on matching the change in likelihood to that implied by the empirical 
distribution. 

(Marlin et al., 2010) recently compared these algorithms on a variety of tasks and found SML to 
be the most attractive method when taking computational complexity into account. Unfortunately, 
these results fail to address the main shortcomings of SML. First, it reUes on Gibbs sampling to 
extract negative samples: a poor choice when samphng from multi-modal distributions. Second, 
to guarantee convergence, the learning rate must be annealed throughout learning in order to offset 
the loss of ergodicity ' incurred by the Markov chain due to parameter updates (Younes, 1998; 
Desjardins et al., 2010). Using tempering in the negative phase of SML (Desjardins et al., 2010; 
Cho et al., 2010; Salakhutdinov, 2010b; Salakhutdinov, 2010a) appears to address these issues to 
some extent. By performing a random walk in the joint (configuration, temperature) space, negative 
particles can escape regions of high probability and travel between disconnected modes. Also, since 
high temperature chains are inherently more ergodic, the sampler as a whole exhibits better mixing 
and results in better convergence properties than traditional SML. 

Tempering is still no panacea however. Great care must be taken to select the set of temperatures 
T = {Ti,...,Tm;Ti <Ti <TMyi e [1,M],M gN} over which to run the simulation. Having 
too few or incorrectly spaced chains can result in high rejection ratios (tempered transition), low 
return rates (simulated tempering) or low swap rates between neighboring chains (parallel temper- 
ing), which all undermine the usefulness of the method. In this work, we show that the choice of 
T can be automated for parallel tempering, both in terms of optimal temperature spacing, as well 
as the number of chains to simulate. Our algorithm relies heavily on the work of (Katzgraber et al., 
2006), who were the first to show that optimal temperature spacing can be obtained by minimizing 
the average return time of particles imder simulation. 

The paper is organized as follows. We start with a brief review of SML, then explore the details 
behind SML with Parallel Tempering (SML-PT) as described in (Desjardins et al., 2010). Following 
this, we show how the algorithm of Katzgraber et al. can be adapted to the online gradient setting for 
use with SML-PT and show how chains can be created dynamically, so as to maintain a given level 
of ergodicity throughout training. We then proceed to show various results on a complex synthetic 
dataset. 

2 SML with Optimized Parallel Tempering 
2.1 ParaUel Tempered SML (SML-PT) 

We start with a very brief review of SML, which will serve mostly to anchor our notation. For 
details on the actual algorithm, we refer the interested reader to (Tieleman & Hinton, 2009; MarUn 
et al., 2010). RBMs are parametrized by 6* = {W, b, c}, where bj is the i-th hidden bias, Cj the 

'We use the term "ergodicity" rather loosely, to reflect the amount of time required for the states sampled 
by the Markov chain, to reflect the true expectation we wish to measure. 
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j-th visible bias and Wij is the weight connecting units hi to Vj. They belong to the family of log- 
linear models whose energy function is given by E(x) = — Y^f, 9k(j)k{^), where (pk are functions 
associated with each parameter 9^- In the case of RBMs, X = (v,h) and(/)(v,h) = (hv'^,h,v). 
For this family of model, the gradient of Equation 1 simplifies to: 

^^^1^ = Ep(h|v)[<^(v,h)]-Ep(.,h)[</'(v,h)]. (2) 

As was mentioned previously, SML approximates the gradient by drawing negative phase samples 
(i.e. to estimate the second expectation) from a persistent Markov chain, which attempts to track 
changes in the model. If we denote the state of this chain at timestep t as and the i-th training 
example as v^'\ then the stochastic gradient update follows (/)(v(*\h) — (/)(v^j., h^;.), where 
h = E[h|v = v^'^], and v^j. is obtained after k steps of alternating Gibbs starting from state 
and h-_^fe = E[h|v = v".^;- ]. 

Training an RBM using SML-PT maintains the positive phase as is. During the negative phase 
however, we create and sample from an an extended set of M persistent chains, {p^. (v,h)|i e 
[l,M],/3j > /3j ■^=> i < j}. Here each p^. (v, h) = '^^pi-^i^i^)) represents a smoothed 
version of the distribution we wish to sample from, with the inverse temperature /Sj = l/Tj S [0, 1] 
controlling the degree of smoothing. Distributions with small /3 values are easier to sample from as 
they exhibit greater ergodicity. 

After performing k Gibbs steps for each of the M intermediate distributions, cross-temperature 
state swaps are proposed between neighboring chains using a Metropolis-Hastings-based swap ac- 
ceptance criterion. If we denote by Xj the joint state (visible and hidden) of the i-th chain, the swap 
acceptance ratio Vi for swapping chains + 1) is given by: 

r. = max(l,^^^^±i%lM) (3) 



Although one might reduce variance by using free-energies to compute swap ratios, we prefer using 
energies as the above factorizes nicely into the following expression: 

n = exp((A - A+i) • (E(xi) - E(xi+i))), (4) 



While many swapping schedules are possible, we use the Deterministic Even Odd algorithm (DEO) 
(Lingenheil et al., 2009), described below. 



2.2 Return Time and Optimal Temperatures 



Conventional wisdom for choosing the optimal set T has relied on the "flat histogram" method 
which selects the parameters /3i such that the pair-wise swap ratio is constant and independent of 
the index i. Under certain conditions (such as when sampling from multi-variate Gaussian distribu- 
tions), this can lead to a geometric spacing of the temperature parameters (Neal, 1994). (Behrens 
et al., 2010) has recently shown that geometric spacing is actually optimal for a wider family of 
distributions characterized by Ep{'E{x)) = Ki/(3 + K2, where denotes the expectation over 
inverse temperature and Ki , K2 are arbitrary constants. 

Since this is clearly not the case for RBMs, we turn to the work of (Katzgraber et al., 2006) who 
propose a novel measure for optimizing T. Their algorithm directly maximizes the ergodicity of 
the sampler by minimizing the time taken for a particle to perform a round-trip between /3i and (Sm- 
This is defined as the average "return time" t,.(. The benefit of their method is striking: temperatures 
automatically pool around phase transitions, causing spikes in local exchange rates and maximizing 
the "flow" of particles in temperatvire space. 

The algorithm works as follows. For Ng sampling updates: 

• assign a label to each particle: those swapped into (3i are labeled as "up" particles. Simi- 
larly, any "up" particle swapped into Pm becomes a "down" particle. 
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• after each swap proposal, update the histograms n„ (i) ,nd{i), counting the number of "up" 
and "down" particles for the Markov chain associated with ^j. 

• define fup{i) = ;ri^+n;i(i) ' fraction of "up"-moving particles at By construction, 
notice that fupiPi) = 1 and fupi^M) = 0. /„p thus defines a probability distribution of 
"up" particles in the range /3m]- 

• The new inverse temperature parameters (5' are chosen as the ordered set which assigns 
equal probabiUty mass to each chain. This yields an curve which is linear in the chain 
index. 

The above procedure is applied iteratively, each time increasing Ng so as to fine-tune the /3/s. 
To monitor retum time, we can simply maintain a counter Ti for each particle x^, which is (1) 
incremented at every sampling iteration and (2) reset to whenever has label "down" and is 
swapped into A lower-bound for return time is then given by frt = Y^'iLo Ti- 

2.3 Optimizing T while Learning 

2.3.1 Online Beta Adaptation 

While the above algorithm exhibits the right properties, it is not very well suited to the context of 
learning. When training an RBM, the distribution we are sampling from is continuously changing. 
As such, one would expect the optimal set T to evolve over time. We also do not have the luxury of 
performing Ng sampling steps after each gradient update. 

Our solution is simple: the histograms n„ and nj, are updated using an exponential moving average, 
whose time constant is in the order of the return time frt- Using frt as the time constant is crucial 
as it allows us to maintain flow statistics at the proper timescale. If an "up" particle reaches the i-th 
chain, we update n„(i) as follows: 

nl+\t)^nlm-l/fl,) + l/f*^„ (5) 

where f*^ is the estimated return time at time t. 

Using the above, we can estimate the set of optimal inverse temperatures /3j'. Beta values are updated 
by performing a step in the direction of the optimal value: = /3| /x(/3- — /3|), where n is 
a learning rate on 13. The properties of (Katzgraber et al., 2006) naturally enforce the ordering 
constraint on the /3i's. 

2.3.2 Choosing M and /3m 

Another important point is that (Katzgraber et al., 2006) optimizes the set T while keeping the 
bounds /3i and /3m fixed. While /3i = 1 is a natural choice, we expect the optimal (3m to vary 
during leaming. For this reason, we err on the side of caution and use Bm = 0, relying on a chain 
spawning process to maintain sufficiently high swap rates between neighboring parallel chains. 
Spawning chains as required by the sampler should therefore result in increased stability, as well as 
computational savings. 

(Lingenheil et al., 2009) performed an interesting study where they compared the round trip rate 
1 / Trt to the average swap rate measured across all chains. They found that the DEO algorithm, 

which alternates between proposing swaps between chains + 1); Vcvcni} followed hy + 
1); Voddi}), gave rise to a concave function with a broad maximum around an average swap rate 

Our temperature adaptation therefore works in two phases: 

1. The algorithm of Katzgraber et. al is used to optimize {Pf, 1 <i < M}, for a fixed M. 

2. Periodically, a chain is spawned whenever f < fmin> a hyper-parameter of the algorithm. 

Empirically, we have observed increased stability when the index j of the new chain is selected such 
that j = argmax^(|/„p(i) — i e [1,M— 1]. To avoid a long burn-in period, we initiaUze 
the new chain with the state of the (j + l)-th chain and choose its inverse temperature as the mean 
{Pj + l3j+i)/2. A small but fixed bum-in period allows the system to adapt to the new configuration. 
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3 Results and Discussion 



We evaluate our adaptive SML-PT algorithm (SML-APT) on a complex, synthetic dataset. This 
dataset is heavily inspired from the one used in (Desjardins et al., 2010) and was specifically crafted 
to push the limits of the algorithm. 

It is an online dataset of 28x28 binary images, where each example is sampled from a mixture 
model with probability density function fx{x) = J2m=i '^mfv,,^ {x). Our dataset thus consists of 5 
mixture components whose weights Wm are sampled uniformly in the unit interval and normalized 
to one. Each mixture component Ym is itself a random 28x28 binary image, whose pixels are 
independent random variables having a probability p„i of being flipped. From the point of view of 
a sampler performing a random walk in image space, p.,„ is inversely proportional to the difficulty 
of finding the mode in question. The complexity of our synthetic dataset comes from our particular 
choice of w,,, and p„i. ^ Large Wm and small p„i lead to modes which are difficult to sample and in 
which a Gibbs sampler would tend to get trapped. Large p,„ values on the other hand will tend to 
intercept "down" moving particles and thus present a challenge for parallel tempering. 

Figure 1(a) compares the results of training a 10 hidden unit RBM, using standard SML, SML-PT 
with {10,20,50} parallel chains and our new SML-APT algorithm. We performed 10^ updates 
(followed by 2 • 10^ steps of sampling) with mini-batches of size 5 and tested learning rates in 
{10^'^, 10^''}, (3 learning rates in {10^^, lO^"', 10^^}. For each algorithm, we show the results 
for the best performing hyper-parameters, averaging over 5 different runs. Results are plotted with 
respect to computation time to show the relative computational cost of each algorithm. 




Figure 1: (a) Comparison of training likelihood as a function of time for standard SML, SML-PT 
with 10/20/50 chains and the proposed SML-APT (initialized with 10 chains). SML-APT adapts 
the temperature set T = {Ti, Tm] Ti < Ti < Tm} to minimize round trip time between chains 
Ti and T^/, and modifies the number of chains M to maintain a minimal average swap rate. The 
resulting sampler exhibits greater ergodicity and yields better likelihood scores than standard SML 
and SML-PT, without requiring a careful hand-tuning of T- (b) Average return time of each algo- 
rithm. SML-APT successfully minimizes this metric resulting in a return time similar to SML-PT 
10, while still outperforming SML-PT 50 in terms of likelihood. Errors bars represent standard error 
on the mean. 



As we can see, standard SML fails to learn anything meaningful: the Gibbs sampler is unable to 
cope with the loss in ergodicity and the model diverges. SML-PT on the other hand performs much 
better. Using more parallel chains in SML-PT consistently yields a better likelihood score, as well 
as reduced variance. This seems to confirm that using more parallel chains in SML-PT increases 
the ergodicity of the sampler Finally, SML-APT outperforms all other methods. As we will see 
in Figure 2, it does so using only 20 parallel chains. Unfortunately, the computational cost seems 
similar to 50 parallel chains. We hope this can be reduced to the same cost as SML-PT with 20 



= [0.3314, 0.2262, 0.0812, 0.0254, 0.3358] and p = [0.0001, 0.0137, 0.0215, 0.0223, 0.0544] 
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chains in the near future. Also interesting to note, while the variance of all methods increase with 
training time, SML-APT seems immune to this issue. 



We now compare the various metrics being optimized by our adaptive algorithm. Figure 1(b) shows 
the average return time for each of the algorithms. We can see that SML-APT achieves a return 
time which is comparable to SML-PT with 10 chains, while achieving a better likelihood score than 
SML-PT 50. 

We now select the best performing seeds for SML-PT with 50 chains and SML-APT, and show in 
Figure 2, the resulting fup{i) curves obtained at the end of training. 




B [index] B [index] 



(a) SML-APT (b) SML-PT 50 

Figure 2: Return time is minimized by tagging each particle with a label: "up" if the particle visited 
Ti more recently than Tm and "down" otherwise. Histograms n„(i) and nd{i) track the number of 
up/down particles at each temperature T^. Temperatures are modified such that the ratio fup{i) — 
nu{i) / {nu{i) + n(i{i)) is linear in the index i. (a) f^p curve obtained with SML-APT, as a function 
of temperature index (blue) and inverse temperature (red). SML-APT achieves a linear fup in the 
temperature index i. (b) Typical f^p curve obtained with SML-PT (here using 50 chains). /„p is not 
linear in the index i, which translates to larger return times as shown in Fig. 1(b). 

The blue curve plots fup as a function of beta index, while the red curves plots fup as a function of 
/?. We can see that SML-APT results in a more or less linear curve for fup{i), which is not the case 
for SML-PT. In Figure 3(a) we can see the effect on the pair-wise swap statistics r^. As reported in 
(Katzgraber et al., 2006), optimizing T to maintain a linear fup leads to temperatures pooling around 
the bottleneck. In comparison, SML-PT fails to capture this phenomenon regardless of whether it 
uses 20 or 50 parallel chains (figures 3(b)-3(c)). 




(a) SML-APT (b) SML-PT 20 (c) SML-PT 50 



Figure 3: Pairwise swap statistics obtained after 10'' updates. Minimizing return time causes SML- 
APT to pool temperatures around bottlenecks, achieving large swap rates (0.9) around bottenecks 
with relatively few chains. SML-PT on the other hand results in a much flatter distribution, requiring 
around 50 chains to reach swap rates close to 0.8. 
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Finally, Figure 4 shows the evolution of the inverse temperature parameters throughout learning. We 
can see that the position of the bottleneck in temperature space changes with learning. As such, a 
manual tuning of temperatures would be hopeless in achieving optimal return times. 



Figure 4: Graphical depiction of the set {f3i; i £ [1, M]}, of inverse temperature parameters used 
by SML-APT during learning. Temperatures pool around a bottleneck to minimize return time, 
while new chains are spawned to maintain a given average swap rate. Note that the last 20k updates 
actually correspond to a pure sampling phase (i.e. a learning rate of 0). 



4 Conclusion 

We have introduced a new adaptive training algorithm for RBMs, which we call Stochastic Maxi- 
mum Likelihood with Adaptive Parallel Tempering (SML-APT). It leverages the benefits of PT in 
the negative phase of SML, but adapts and spawns new temperatures so as to minimize return time. 
The resulting negative phase sampler thus exhibits greater ergodicity. Using a synthetic dataset, 
we have shown that this can directly translate to a better and more stable likelihood score. In the 
process, SML-APT also greatly reduces the number of hyper-parameters to tune: temperature set 
selection is not only automated, but optimal. The end-user is left with very few dials: a standard 
learning rate on /3i and a minimum average swap rate fmm below which to spawn. 

Much work still remains. In terms of computational cost, we would like a model trained with SML- 
APT and resulting in M chains, to always be upper-bounded by SML-PT initialized with AI chains. 
Obviously, the above experiments should also be repeated with larger RBMs on natural datasets, 
such as MNIST or Caltech Silhouettes. ^ 
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